agent: add membw command for bare-metal DDR bandwidth test#100
Merged
Conversation
Adds CMD_MEMBW (0x0D) that runs memset / read-scan / memcpy via ARM ldmia/stmia kernels (r4-r11, 32 B per loop iter) against a scratch DDR buffer with the MMU cache on, timed using the ARMv7 PMU cycle counter (CCNT). Reports cycles/byte — CPU-clock-invariant, the metric that actually isolates DDR fabric — plus MB/s when the architectural generic timer's CNTFRQ is set by an earlier boot stage. Motivating use case (issue): comparing OpenIPC vs vendor U-Boot on the same gk7205v300 silicon to determine whether an encoder fps gap comes from DDR fabric or from Linux software stack. cycles/byte from membw answers that without any of the userspace cache-attr / CMA / libc-memcpy confounders. Bumps AGENT_VERSION to 4 and advertises CAP_MEMBW in INFO so the host can check support. ARMv7 only (V4 / V5 / V6 family); ARMv5 (ARM926, hi3516cv300) cleanly rejects with ACK_FLASH_ERROR — different PMU register layout, out of scope for the motivating case. Default scratch is placed at LOAD_ADDR + 8 MiB (passed in via the new AGENT_LOAD_ADDR macro from the Makefile) with a guard that rejects any user-supplied addr where [addr, addr + 2*size) overlaps [LOAD_ADDR - 64KB, LOAD_ADDR + 8 MiB] — otherwise an 8 MiB memcpy on the default V4 layout would stomp the running agent's own code. Validated end-to-end on hi3516ev300 + gk7205v300: cycles/byte stable to 0.2% across 4 MiB×8, 8 MiB×16, 16 MiB×8 runs, matching across both SoCs (same V4 silicon family). CNTFRQ is 0 on both — bootrom doesn't initialise the generic timer — so the cycles/byte fallback path is the one exercised in practice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #99.
CMD_MEMBW(0x0D) /RSP_MEMBW(0x87) — runs memset / read-scan / memcpy kernels via ARMldmia/stmia(r4-r11, 32 B per loop iter) against a scratch DDR buffer with the MMU cache on, timed with the ARMv7 PMU cycle counter (CCNT).defib agent membw [--size 4MB] [--iters 8] [--addr 0] [--port ...] [--output human|json]CLI command.cycles/byte(CPU-clock-invariant — the metric that actually isolates DDR fabric from CPU clock variance) plusMB/swhen the architectural generic timer'sCNTFRQis set by an earlier boot stage. IfCNTFRQ == 0the host transparently falls back to cycles/byte.AGENT_VERSION, advertisesCAP_MEMBWin INFO so the host can check support before sending the command.hi3516cv300) cleanly rejects withACK_FLASH_ERRORvia#ifdef CPU_ARM926— different PMU register layout, out of scope for the motivating use case.Why
From the issue: when investigating an encoder fps gap between OpenIPC and vendor firmware on identical
gk7205v300silicon, the key question — "is the DDR fabric slow, or is Linux slow on top of it?" — can't be cleanly answered from inside Linux. CMA reservations, cache attributes, libc memcpy variance and scheduler noise all muddy any userspace number.defib already runs a bare-metal agent in DDR right after SPL brings memory up. That's the exact moment we want to measure raw DDR throughput, before any kernel/ISP/VENC traffic.
defib agent membwgives a reproducible apples-to-apples bandwidth number per firmware.How
Agent C (
agent/main.c,agent/protocol.h)Three inline-asm kernels with
ldmia/stmiaover r4-r11 (8 words = 32 B per memory operation), so OpenIPC vs vendor builds produce identical instruction streams. Cache is on (write-back / write-allocate perstartup.Spage-table fill); the buffer is sized well above L1+L2 so DDR is the actual bottleneck.CCNT is calibrated against
CNTPCT(architectural generic timer, fixed frequency fromCNTFRQ) over a 10 ms window. IfCNTFRQwas never written by the bootrom — and on the V4 family it isn't — the agent returnstimer_hz = 0and the host falls back to the cycles/byte metric. That number alone already answers the original question because it normalises for CPU-clock differences across firmwares, which is the gotcha that bit the reporter in the original investigation.Agent footprint guard
The default scratch sits at
LOAD_ADDR + 8 MiB(a newAGENT_LOAD_ADDRmacro is passed in via MakefileCFLAGS).handle_membwrejects any user-suppliedaddrwhose[addr, addr + 2*size)range overlaps[LOAD_ADDR - 64 KB, LOAD_ADDR + 8 MiB]— otherwise an 8 MiB memcpy on the default V4 layout would stomp the running agent's own code. This was found during real-hardware testing — see the validation section.Python host (
src/defib/agent/client.py,cli/app.py)MembwResultdataclass withcycles_per_byte(ticks, write_amp=1)andmbps(ticks, write_amp=1)helpers (returnsNoneformbpswhentimer_hz == 0).FlashAgentClient.membw(size_bytes, iters, addr)async method.defib agent membwTyper command withhumanandjsonoutput modes.agent infonow listsmembwin the capabilities line when reported by the agent.Tests
agent/test_agent.c): round-trip framing tests for the 12 B request and 32 B response packets.tests/test_agent_protocol.py::TestMembw): four tests usingMockTransport— field parsing, MB/s + cycles/byte math,timer_hz == 0graceful degradation, ARMv5 (ACK_FLASH_ERROR) rejection path.Validation
Real hardware, 2026-05-14:
membwBoth SoCs agree to 0.2% — expected, same V4 silicon family with the same DDR config.
CNTFRQ == 0on both, so MB/s showsn/aand the cycles/byte fallback activates automatically.Tests / lint / cross-build (all green):
make -C agent test HOST_CC=gcc— 5412/5412 pass (includes 2 new framing tests)uv run pytest tests/ -x --ignore=tests/fuzz— 494 pass, 2 skip (includes 4 new TestMembw tests)uv run ruff check src/ tests/— cleanuv run mypy src/defib/ --ignore-missing-imports— cleangk7205v300,hi3516ev300,hi3516cv300(ARMv5 reject path),hi3516cv610;make all-socsbuilds all four default targets.Test plan
make -C agent test HOST_CC=gcc)uv run pytest tests/)gk7205v300silicon, diffcycles_per_byte— that's the motivating measurement the issue was asking for.🤖 Generated with Claude Code